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Abstract 

With  the  vast  number  of  information  resources 
available  today,  a  critical  problem  is  how  to  lo¬ 
cate.  retrieve  and  process  information.  It  would 
be  impractical  to  build  a  single  unified  system 
that  combines  all  of  these  information  resources. 

A  more  promising  approach  is  to  build  special¬ 
ized  information  retrieval  agents  that  provide  ac¬ 
cess  to  a  subset  of  the  information  resources 
and  can  send  reriuests  to  other  information  re¬ 
trieval  agents  when  appropriate.  In  this  paper 
we  present  the  architecture  of  the  individual  in¬ 
formation  retrieval  agents  and  describe  how  this 
architecture  supports  a  network  of  cooperating 
information  agents.  We  describe  how  these  infor¬ 
mation  agents  represent  their  knowledge,  commu¬ 
nicate  with  other  agents,  dynamically  construct 
information  retrieval  plans,  and  learn  about  other 
agents  to  improve  efficiency.  We  have  already 
built  a  small  network  of  agents  that  have  these 
capabilities  and  provide  access  to  information  for 
transportation  planning. 

Introduction 

With  the  expanding  amount  of  information  avail¬ 
able,  the  problem  of  how  to  combine  distributed, 
heterogeneous  information  sources  becomes  more 
and  more  critical.  The  available  information 
sources  include  traditional  databases,  flat  files, 
knowledge  bases,  programs,  etc.  TYaditioual  ap¬ 
proaches  to  building  distributed  or  federated  sys¬ 
tems  do  not  scale  well  to  the  large,  diverse,  and 
growing  number  of  information  sources.  Recent 
Internet  systems  such  as  Mosaic,  WAIS,  and  Go¬ 
pher  allow  users  to  search  through  large  numbers 
of  information  sources,  but  provide  very  limited 

*The  research  reported  here  was  supported  in  part 
by  Rome  Laboratory  of  the  Air  Force  Systems  Com¬ 
mand  and  the  Advanced  Research  Projects  Agency 
under  contract  no.  F30602-91-C-0081,  and  in  part  by 
the  National  Science  Foundation  under  grant  number 
IRI-9313993.  The  views  and  conclusions  contained  in 
this  report  are  those  of  the  authors  and  should  not  be 
interpreted  as  representing  the  official  opinion  or  pol¬ 
icy  of  RL,  ARPA,  NSF,  the  U.S.  Government,  or  any 
person  or  agency  connected  with  them. 


capabilities  for  locating,  combining,  processing, 
and  organizing  information. 

A  promising  approach  to  this  problem  is  to  pro¬ 
vide  access  to  the  large  number  of  information 
sources  by  organizing  them  into  a  network  of  in- 
formaiion  agents  [Papazoglou  ei  ai,  1992].  The 
goal  of  each  agent  is  to  provide  information  and 
expertise  on  a  specific  topic  by  drawing  on  rele¬ 
vant  information  from  other  information  agents. 
Similar  to  the  way  current  information  sources 
are  independently  constructed,  these  information 
agents  can  be  developed  and  maintained  sepa¬ 
rately,  drawing  on  the  other  available  information 
agents  and  providing  a  new  information  source 
that  others  cjui  then  build  upon.  Each  information 
agent  is  another  information  source,  but  provides 
a  desperately  needed  abstraction  of  the  many  in¬ 
formation  sources  available.  Similarly,  an  informa¬ 
tion  source,  such  as  a  database  or  program,  can  be 
turned  into  a  simple  information  agent  by  build¬ 
ing  the  appropriate  interface  code  around  the  in¬ 
formation  source.  Given  this  simple  mapping  be¬ 
tween  information  agent  and  information  source, 
we  will  use  these  terms  interchangeable  through¬ 
out  the  rest  of  this  paper. 

Figure  1  shows  an  example  network  of  infor¬ 
mation  retrieval  agents.  This  network  includes 
agents  such  as  a  USC  cr  mputer  science  technical 
report  agent,  which  is  an  agent  that  only  provides 
information  about  and  access  to  the  department 
technical  reports.  This  agent  is  in  turn  used  to 
construct  both  a  computer  science  technical  re¬ 
port  agent  that  spans  multiple  universities  as  well 
as  use  technical  report  agent  that  spans  depart¬ 
ments  within  the  university.  These  agents  could  in 
turn  be  used  to  construct  other  agents,  and  so  on. 
An  important  feature  of  this  organization  is  that 
the  individual  agents  can  be  independently  built 
and  maintained.  This  makes  it  possible  to  scale 
the  architecture  to  large  numbers  of  information 
sources. 

To  build  a  network  of  specialized  informa¬ 
tion  agents,  we  need  an  architecture  for  a  single 
agent  that  can  be  instantiated  to  provide  multiple 
agents.  In  previous  work  we  developed  an  informa¬ 
tion  server,  called  SIMS  [Arens  ei  al.,  1993],  which 
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Figure  1:  Network  of  Information  Retrieval  Agents 
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provides  access  to  heterogeneous  data  and  knowl¬ 
edge  bases.  In  addition,  Paster  ■'  [1992]  devel¬ 
oped  a  system  called  the  Loom  Interface  Manager 
(LIM),  which  we  use  to  build  agents  for  accessing 
individual  relational  databases.  Using  SIMS  and 
LIM,  we  have  built  a  small  network  of  informa¬ 
tion  retrieval  agents  that  interact  over  the  Inter¬ 
net,  We  are  now  in  the  process  of  building  a  more 
ambitious  network  of  interacting  agents. 

The  SIMS  architecture  is  shown  in  Figure  2. 
Each  SIMS  agent  contains  a  detailed  model  of  its 
domain  of  expertise  and  models  of  the  information 
sources  that  are  available  to  it.  Given  an  informa¬ 
tion  request,  an  agent  selects  an  appropriate  set  of 
information  sources,  generates  a  plan  to  retrieve 
and  process  the  data,  uses  knowledge  about  the 
information  sources  to  reformulate  the  plan  for  ef¬ 
ficiency,  and  then  executes  it. 

This  paper  presents  the  design  of  an  individ¬ 
ual  SIMS  agent  and  discusses  the  issues  that  arise 
in  using  this  design  to  build  a  network  of  cooper¬ 
ating  information  retrieval  agents.  First,  we  de¬ 
scribe  how  the  knowledge  of  an  agent  is  repre¬ 
sented.  Second,  we  describe  how  the  agents  ex¬ 
change  queries  and  data  with  one  another.  Third, 
we  describe  how  information  requests  are  flexibly 
and  efficiently  processed.  Fourth,  we  describe  how 
the  system  learns  about  the  other  agents  in  order 
to  improve  performance  over  time.  Fifth,  we  iden¬ 
tify  how  the  different  features  of  this  design  sup¬ 
port  flexible,  efficient,  and  modular  agents.  Sixth, 
we  describe  the  closely  related  work  in  this  area. 
Finally,  we  conclude  with  a  discussion  of  the  cur¬ 
rent  status  and  the  work  that  remains  to  be  done. 


Representing  an  Agent’s  Knowledge 

Each  agent  contains  a  model  of  its  domain  of  ex¬ 
pertise  and  models  of  the  other  agents  and  infor¬ 
mation  sources  that  can  provide  relevant  informa¬ 
tion.  We  will  refer  to  these  two  types  of  models  as 
the  domain  model  and  information  source  mod¬ 
els,  The  domain  model  provides  descriptions  of 
the  classes  of  objects  in  the  domain,  relationships 
between  these  classes  (e.g..,  subclass  and  super¬ 
class),  relations  on  each  class,  and  other  domain- 
specific  information.  The  information  source  mod¬ 
els  describe  both  the  contents  of  the  information 
sources  and  the  relationship  between  those  mod¬ 
els  Md  the  domain  model.  Both  the  domain  and 
information  source  models  are  expressed  in  the 
Loom  knowledge  representation  language  [Mac¬ 
Gregor,  1990).  Loom  is  an  A I  knowledge  represen¬ 
tation  system  based  on  KL-ONE  [Br<ichman  and 
Schmolze,  1985).  Loom  provides  a  language  for 
representing  hierarchies  of  classes  and  relations, 
as  well  as  efficient  mechanisms  for  classifying  in¬ 
stances  of  classes  and  reasoning  about  descriptions 
of  object  classes. 

The  domsun  and  information  source  models  con¬ 
stitute  the  general  knowledge  of  an  agent  and  are 
used  to  determine  how  to  process  an  information 
request.  The  domain  model  of  an  agent  defines  its 
area  of  expertise  and  the  terminology  for  interact¬ 
ing  with  that  agent.  The  information  source  mod¬ 
els  describe  the  resources  that  are  available  to  an 
agent  to  answer  information  requests.  These  mod¬ 
els  do  not  need  to  contain  a  complete  description 
of  another  agent  or  information  source,  but  rather 
only  the  portions  of  those  agents  or  information 
sources  that  are  directly  relevant.  Specializing  the 


Figure  2:  The  Architecture  of  an  Agent 


agents  for  specific  areas  provides  a  modular  orga¬ 
nization  of  the  vast  number  of  information  sources 
and  provides  a  clear  delineation  of  the  types  of 
queries  each  agent  can  handle.  In  complex  do¬ 
mains,  the  domain  can  be  broken  down  into  mean¬ 
ingful  subparts  and  an  information  agent  can  be 
built  for  each  subpait. 

Domain  Models 

Each  information  agent  is  specialized  to  a  single 
“application  domain”  and  provides  access  to  the 
available  information  sources  within  that  domain. 
The  largest  application  domain  that  we  have  to 
date  is  a  transportation  planning  domain,  which 
involves  information  about  the  movement  of  per¬ 
sonnel  and  materiel  from  one  location  to  another 
using  aircraft,  ships,  trucks,  etc. 

As  described  above,  the  application  domain 
models  arc  defined  in  the  Loom  knowledge  rep¬ 
resentation  system.  This  provides  a  semantic  de¬ 
scription  of  the  objects  and  relations  in  a  do¬ 
main,  which  is  used  extensively  for  processing 
queries.  Figure  3  shows  a  fragment  of  the  domain 
model  in  the  transportation  planning  domain.  In 
this  figure,  the  nodes  represent  classes  of  objects, 
the  thick  arrows  represent  subclass  relationships, 
and  the  thin  arrows  represent  relations  between 
classes. 

The  classes  defined  in  the  domain  model  do  not 
necessarily  correspond  directly  to  the  objects  de¬ 
scribed  in  any  particular  information  source.  The 
domain  model  is  intended  to  be  a  description  of 
the  application  domain  from  the  point  of  view  of 
users  or  other  information  agents  that  may  need 
to  obtain  information  about  the  application  do¬ 
main.  The  terms  in  the  domain  model  provide  the 
language  to  define  the  contents  of  an  information 
source  to  the  agent. 


Modeling  Information  Sources 

The  critical  part  of  the  information  source  mod¬ 
els  is  the  description  of  the  contents  of  the  infor¬ 
mation  sources.  This  consists  of  a  description  of 
the  classes  contained  in  the  information  source, 
as  well  as  the  relationship  between  these  classes 
and  the  classes  in  the  domain  model.  The  map¬ 
pings  between  the  domain  model  and  the  infor¬ 
mation  source  model  are  needed  for  transforming 
a  domain-level  query  into  a  .set  of  queries  to  actual 
information  sources. 

Figure  4  illustrates  how  an  information  source 
is  modeled  in  Loom  and  how  it  is  related  to  the 
domain  model.  All  of  the  concepts  and  relations  in 
the  information  source  model  are  mapped  to  con¬ 
cepts  and  relations  in  the  domain  model.  A  map¬ 
ping  link  between  two  concepts  indicates  that  they 
represent  the  same  class  of  information.  Thus,  if 
the  user  requests  all  seaports,  that  information  can 
be  retrieved  from  the  GEO  agent,  which  has  infor¬ 
mation  about  seaports. 

Communication 

Queries  to  an  information  agent  are  expressed  in 
the  Loom  query  language.  These  queries  are  com¬ 
posed  of  terms  in  a  general  domain  model,  so  there 
is  no  need  for  other  agents  or  a  user  to  know  or 
even  be  aware  of  the  terms  used  in  the  underlying 
information  sources.  Given  a  query,  an  informa¬ 
tion  agent  identifies  the  appropriate  information 
sources  and  issues  queries  to  those  sources  to  ob¬ 
tain  the  requisite  data  for  answering  the  query. 
To  do  this,  an  information  agent  translates  the 
domwn-level  query  into  a  set  of  queries  to  more 
specialized  information  agents  using  the  terms  ap¬ 
propriate  to  each  of  those  agents. 

The  queries  to  other  agents  are  also  expressed 
in  the  Loom  query  language.  In  order  to  make 
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an  existing  database  or  other  application  program 
available  to  the  network  of  agents  requires  build¬ 
ing  a  wrapper  around  the  existing  system  to  turn 
it  into  an  agent  with  access  to  that  information 
source.  Note  that  only  one  such  wrapper  would 
need  to  be  built  for  any  given  type  of  information 
source  (e.g.,  relational  database,  object-oriented 
database,  flat  file,  etc).  The  advantage  of  this  ap¬ 
proach  is  that  it  greatly  simplifies  the  individual 
agents  since  they  only  needs  to  handle  one  under¬ 
lying  language.  This  maices  it  possible  scale  the 
network  into  many  agents  with  access  to  many  dif¬ 
ferent  types  of  information  sources. 

Figure  5  illustrates  a  query  expressed  in  the 
Loom  language.  This  query  requests  all  seaports 
and  the  corresponding  ships  that  can  be  accom¬ 
modated  within  each  port.  The  first  argument 
to  the  rstrisvs  expression  is  the  parameter  list, 
which  specifies  which  parameters  of  the  query  to 
return.  The  second  argument  is  a  description  of 
the  information  to  be  retrieved.  This  description 
is  e.r:pres8ed  as  a  conjunction  of  concept  and  rela¬ 
tion  expressions,  where  the  concepts  describe  the 


classes  of  information,  and  the  relations  describe 
the  constraints  on  these  classes.  The  first  clause 
of  the  query  is  an  example  of  a  concept  expression 
and  specifies  that  the  variable  ?port  describes  a 
member  of  the  class  seaport.  The  second  clause  is 
an  example  of  a  relation  expression  and  states  that 
the  relation  port  jiaae  holds  between  the  value  of 
?port  and  the  variable  TportJiame.  This  query 
requests  all  seaport  and  ship  pairs  where  the  depth 
of  the  port  exceeds  the  draft  of  the  ship. 

(retrieve 

(?portJia»e  ?depth  ?ship.type  ?draft) 

(and  (seaport  ?port) 

(portJiaae  ?port  ?portjia»e) 

( channel j>f  ?port  ?channel) 
(channel.dcpth  ?channel  ?depth) 

(ship  ?ship) 

(vehicle-type  Tship  ?ship-type) 
(■ax-draft  ?ship  ?draft) 

(>  ?depth  ?draft))) 

Figure  5:  Example  Loom  Query 

In  addition  to  sending  queries  to  other  agents, 
the  agents  also  need  the  capability  to  send  back 
objects  in  response  to  their  queries.  Commu¬ 
nication  between  information  agents  is  done  us¬ 
ing  the  Knowledge  Query  Manipulation  Language 
(KQML)[Finir.  ei  o/.,  1992].  KQML  is  an  agent 
communication  language  that  handles  the  inter¬ 
face  protocols  for  transmitting  queries,  returning 
the  apj)ropriate  information,  and  building  the  ap- 
propriiite  internal  structures.  We  currently  use 
this  language  to  send  queries  between  a  SIMS 
agent  and  the  LTM  agents,  which  provide  access 
to  relational  databases  [Pastor  et  ai,  1992]. 

Processing  an  Information  Request 

The  core  contribution  of  an  information  agent  is 
the  ability  to  intelligently  retrieve  and  process 
data.  Information  sources  are  constantly  chang¬ 
ing;  new  information  becomes  available,  old  infor¬ 
mation  may  be  eliminated  or  temporarily  unavail¬ 
able,  and  so  on.  Thus,  an  agent  needs  the  capa¬ 
bilities  to  dynamically  select  an  appropriate  set  of 
information  sources,  construct  a  plan  for  retriev- 
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Figure  6:  Fragment  of  the  Domain  and  Information  Source  Models 


ing  and  processing  the  information,  and  optimize 
this  plan  to  ensure  that  the  data  is  retrieved  effi¬ 
ciently.  This  section  describes  each  of  these  pro¬ 
cessing  steps  in  turn. 

Information  Source  Selection 

The  first  step  in  answering  a  query  expressed  in 
the  terms  of  the  domain  model  is  to  select  the 
appropriate  information  sources.  This  is  done  by 
mapping  from  the  concepts  in  the  domain  model 
to  the  concepts  in  the  information  source  models. 
If  the  user  requests  information  about  ports  and 
there  is  a  single  information  agent  that  provides 
access  to  information  on  ports,  then  the  map¬ 
ping  is  straightforward.  However,  in  some  cases 
there  may  be  several  agents  that  provide  access  to 
the  same  information  and  in  other  cases  no  single 
agent  can  provide  the  required  information  and  it 
will  need  to  be  draw  n  from  several  different  agents. 
The  process  of  selecting  the  information  sources  is 
performed  by  reformulating  the  terms  in  the  orig¬ 
inal  query  into  the  terms  that  correspond  to  the 
available  information  sources. 

Consider  the  fragment  of  the  knowledge  base 
shown  in  Figure  6,  which  covers  the  knowledge 
relevarit  to  the  example  query  in  Figure  5.  The 
concepts  Seaport,  Channel  and  Ship  have  sub¬ 
concepts,  shown  by  the  shaded  circles,  that  cor¬ 
respond  to  concepts  whose  instances  can  be  re¬ 
trieved  from  some  information  agent.  Thus,  the 
OEO  AGENT  contains  information  about  both  sea¬ 
ports  and  channels,  and  the  PORT  AGENT  contains 
information  about  only  seaports.  Thus,  if  the  user 
asks  for  seaports,  then  the  query  must  be  trans¬ 
lated  into  one  of  the  information  source  concepts 
—  seaports  from  the  QEO  agent  oi  ports  from 
the  PORT  AGENT. 

In  order  to  select  the  information  sources 
for  answering  a  query,  an  agent  applies  a 
set  of  reformulation  operators  to  transform 


the  domain-level  concepts  into  concepts  that 
can  be  retrieved  directly  from  an  informa¬ 
tion  source.  The  system  has  a  number  of 
truth-preserving  reformulation  operations  that 
can  be  used  for  this  task.  The  operations 
include  Select-Information-Source,  Generalize- 
Concept,  Specialize-Concept,  Partition-Concept, 
and  Decompose-Relation.  These  operations  are 
described  briefly  below. 

Select-Information- Source  maps  a  domain- 
level  concept  directly  to  an  information-source- 
level  concept.  In  many  cases  this  will  simply  be 
a  direct  mapping  from  a  concept  such  as  Seaport 
to  a  concept  that  corresponds  to  the  seaports  in 
some  information  source.  There  may  be  multiple 
information  sources  that  contain  the  same  infor¬ 
mation,  in  which  case  the  domain- level  query  can 
be  reformulated  in  terms  of  any  one  of  the  infor¬ 
mation  source  concepts.  In  general,  the  choice  is 
made  so  as  to  minimize  the  number  of  different 
information  sources  used  to  answer  a  query. 

Generalize- Concept  uses  knowledge  about 
the  relationship  between  a  class  and  a  superclass 
to  reformulate  a  query  in  terms  of  the  more  gen¬ 
eral  concept.  In  order  to  preserve  the  semantics  of 
the  original  request,  one  or  more  additional  con¬ 
straints  may  need  to  be  added  to  the  query  in 
order  to  avoid  retrieving  extraneous  data.  For  ex¬ 
ample,  if  a  query  requires  some  information  about 
airports,  but  the  information  sources  that  corre¬ 
spond  to  the  airport  concept  do  not  contain  the 
requested  information,  then  it  may  be  possible  to 
generalize  aurport  to  port  and  retrieve  the  informa¬ 
tion  from  some  information  source  that  contains 
port  information.  In  order  to  ensure  that  no  ex¬ 
traneous  data  is  returned,  the  reformulation  will 
include  a  join  between  airport  and  port. 

Specialize-Concept  replaces  a  concept  with  a 
more  specific  concept  by  checking  the  constraints 
on  the  query  to  see  if  there  is  an  appropriate  spe- 


cialization  of  the  requested  concept  that  would 
satisfy  it.  For  example,  if  a  query  requests  all  ports 
with  an  elevation  greater  than  1000  feet,  it  may  be 
possible  to  reformulate  this  in  terms  of  all  airports 
with  an  elevation  greater  than  1000  feet  since  there 
are  no  seaports  with  an  elevation  this  high.  Even 
if  there  was  an  information  source  corresponding 
to  the  port  concept,  this  may  be  a  more  efficient 
way  to  retrieve  the  data.  Range  information  such 
as  this  is  naturally  represented  and  stored  as  part 
of  the  domain  model. 

Partition-Concept  uses  knowledge  about  set 
coverings  (a  set  of  concepts  that  include  all  of  the 
instances  of  another  concept)  to  specialize  a  con¬ 
cept.  This  information  is  used  to  replace  a  re¬ 
quested  concept  with  a  set  of  concepts  that  cover 
it.  For  example,  if  a  query  requests  information 
about  ports  and  there  are  no  information  source 
that  cover  ports,  it  may  be  possible  to  reformulate 
the  query  into  a  set  of  subqueries  that  cover  ports. 
If  ports  are  covered  by  seaports,  airports,  and  rail 
ports,  then  the  original  query  can  be  replaced  by 
queries  on  each  of  these  subconcepts. 

Decompose-Relation  replaces  a  relation  de¬ 
fined  betv/een  concepts  in  the  domain  model  with 
equivalent  terms  that  are  available  in  the  informa¬ 
tion  source  models.  For  example,  chaimeljol  is 
a  property  of  the  domain  model,  but  it  is  not  de¬ 
fined  in  any  information  source,  Instead,  it  can  be 
replaced  by  joining  over  a  key,  geoloc-code,  that 
in  this  case  happens  to  occur  in  both  seaport  and 
channel. 

Reformulation  is  performed  by  treating  the  re¬ 
formulation  operators  as  a  set  of  transformation 
operators  and  then  using  a  planning  system  to 
search  for  a  reformulation  of  the  given  query  de¬ 
scription.  The  planner  searches  for  a  mapping 
from  each  of  the  concepts  and  relations  in  the 
query  into  concepts  and  relations  for  which  data 
is  available. 

For  example,  consider  the  query  shown  in  Fig¬ 
ure  5.  There  are  two  concept  expressions  -  one 
about  ships  and  the  other  about  seaports.  In  the 
first  step,  the  system  attempts  to  translate  the 
seaport  expression  into  a  information-source-level 
expression.  Unfortunately,  none  of  the  informa¬ 
tion  sources  contain  information  that  corresponds 
to  channel jol.  Thus,  the  system  must  reformu¬ 
late  channel joi,  using  the  decompose  operator. 
This  expresses  the  fact  that  channeljol  is  equiv¬ 
alent  to  performing  a  join  over  the  keys  for  the 
seaport  and  channel  concepts.  The  resulting  re¬ 
formulation  is  shown  in  Figure  7. 

The  next  step  reformulates  the  seaport  por¬ 
tion  of  the  query  into  a  corresponding  informa¬ 
tion  source  query.  This  can  be  done  using  the 
select-information-source  operator,  which  selects 
between  the  GEO  and  PORT  information  agents. 
In  this  case  GEO  is  selected  because  the  infor¬ 
mation  on  channels  is  only  available  in  the  GEO 


(retrieve 

(?portJi2u«e  ?depth  Tship.type  ?dratt) 

(:and  (seaport  ?port) 

(portJiane  ?port  ?portji!une) 
(geolocjcode  ?port  ?geocode) 

(channel  ?chaimel) 

(geolocjcode  ?channel  ?gecoode) 
(chemneljdepth  Tchannel  ?depth) 

(ship  ?ship) 

(vehicle.type  Tship  Tship-type) 

(range  Tship  ?rMge) 

(>  ?range  10000) 

(maxjdraft  ?ship  ?draft) 

(>  ?depth  ?draft))) 

Figure  7:  Result  of  Applying  the  Decompose  Op¬ 
erator  to  Eliminate  channeljol 

(retrieve 

(?portJicU«c  ?depth  Tship.type  ?draft) 

(:and  (seaports  ?port) 

(seaports. port juo  ?port  ?portjiane) 
(seaports. glcjcd  ?port  Tglc.cd) 
(channels  ?channel) 

(channels. glcjcd  ?channel  ?glc-cd) 
(channels. chjiepthjft  ?channel  ?depth) 
(notional^hip  Tship) 

(notional.ship.3ht Jim  ?ship  Tship.type) 
(not ional.8hip. range  Tship  ?range) 

(>  ?range  10000) 

(notionaljship.maxjlraft  ?ship  ?draft) 
(<  ?draft  ?depth)))) 

Figure  8:  Result  of  Selecting  Information  Sources 
for  Channels  and  Ships 


agent.  The  channel  and  ship  portions  of  the  query 
are  then  similarly  reformulated.  The  final  query, 
which  is  the  result  of  reformulating  the  entire 
query  is  shown  in  Figure  8. 

Query  Access  Planning 

Once  the  system  has  reformulated  the  query  so 
that  it  uses  only  terms  from  its  information  source 
models,  the  next  step  is  to  generate  a  query  plan 
for  retrieving  and  processing  the  data.  The  query 
plan  specifies  the  operations  for  processing  the 
data,  as  well  as  the  order  in  which  to  perform  these 
operations. 

There  may  be  a  significant  difference  in  effi¬ 
ciency  between  different  plans  for  a  query.  There¬ 
fore,  the  planner  searches  for  a  plan  that  can  be 
implemented  as  efficiently  as  possible.  To  do  this 
the  planner  must  take  into  account  the  cost  of  ac¬ 
cessing  the  different  information  sources,  the  cost 
of  retrieving  intermediate  results,  and  the  cost  of 
combining  these  intermediate  results  to  produce 
the  final  results.  In  addition,  since  the  information 
sources  are  distributed  over  different  machines  or 
even  different  sites,  the  planner  takes  advantage 
of  potential  parallelism  and  generates  subqueries 
that  can  be  issued  concurrently. 


(and 

(seaports  Tpart) 

(seaports  ^_cd  7port  7glc_ed) 
(seaports4>ert_aa«  Tpart  Tfwrtjiame) 
(channels  Tchannel) 

(channels  .glc_cd  Tchannel  Ttempl) 

(>  Ttempl  Tgic.cd) 

(channels  xh_deplh_n  Tchannel  Tdeplh)) 


Igee  local 


Figure  9:  Parallel  Query  Access  Plan 


There  are  five  general  operators  that  are  used 
to  plan  out  the  processing  of  a  query; 

•  Move  -  Moves  a  set  of  data  from  one  information 
source  to  another  information  source. 

•  Join  -  Combines  two  sets  of  data  into  a  com¬ 
bined  set  of  data  using  the  given  join  relations. 

•  Retrieve  -  Specifies  the  data  that  is  to  be  re 
trieved  .from  a  particular  information  source. 

•  Select  -  Selects  a  subset  of  the  data  using  the 
given  constraints. 

•  Assign  -  Constructs  a  new  term  in  the  data  from 
some  combination  of  the  existing  data. 

Each  of  these  operations  manipulates  one  or  more 
sets  of  data,  where  the  data  is  specified  in  the  same 
terms  that  are  used  for  communicating  with  SIMS. 
This  simplifies  the  input/output  since  there  is  no 
conversion  between  languages. 

The  planner  is  implemented  in  a  version  of 
UCPOP  [Barrett  el  al,  1993]  that  has  been 
modified  to  generate  parallel  execution  plans 
[Knoblock,  1994].  The  system  searches  through 
the  space  of  possible  plans  using  a  best-first  search 
until  a  complete  plan  is  found. 

The  plan  generated  for  the  example  query  in 
Figure  8  is  shown  in  Figure  9.  In  this  example, 
the  system  petitions  the  given  query  such  that 
the  ship  information  is  retrieved  in  a  single  query 
to  the  ASSETS  agent  and  the  seaport  and  chan¬ 
nel  information  is  retrieved  in  a  single  query  to 
the  GEO  agent.  All  of  the  information  is  brought 
into  the  local  system  (Loom)  where  the  draft  of 
the  ships  can  be  compared  against  the  depth  of 
the  seaports.  Once  the  final  set  of  data  has  been 
generated,  it  is  returned  by  the  agent. 

The  planner  attempts  to  minimize  the  overall 
execution  time  by  searching  for  a  query  that  can  be 
implemented  as  efficiently  as  possible.  It  does  this 
by  using  a  simple  estimation  function  to  calculate 


the  expected  cost  of  the  various  operations  and 
then  selecting  a  plan  that  has  the  lowest  overall 
parallel  execution  cost.  In  the  example,  the  agent 
leaves  the  join  between  the  seaports  and  channels 
to  be  performed  by  the  remote  GEO  agent  since 
this  will  be  cheaper  than  moving  the  information 
into  the  local  system.  If  the  system  could  per¬ 
form  all  of  the  work  in  one  remote  system,  then  it 
would  completely  bypass  the  local  agent  and  re¬ 
turn  the  data  directly  to  the  agent  that  requested 
the  information.  Once  an  execution  plan  has  been 
produced,  it  is  sent  to  the  reformulation  system 
for  global  optimization,  as  described  in  the  next 
section. 

Semantic  Query  Reformulation 

The  goal  of  the  semantic  query  reformulation  is 
to  use  reformulation  to  search  for  the  least  ex¬ 
pensive  query  in  the  space  of  semantically  equiv¬ 
alent  queries.  The  reformulation  from  one  query 
to  another  is  done  through  logical  inference  using 
information-source  abstractions,  the  abstracted 
knowledge  of  the  contents  of  relevant  information 
sources.  See  [Hsu  and  Knoblock,  1993a]  for  an  ex¬ 
planation  of  how  rules  like  these  are  automatically 
learned.  The  information-source  abstractions  de¬ 
scribe  the  information  in  terms  of  a  set  of  closed 
formulas  of  first-order  logic.  These  formulas  de¬ 
scribe  an  information  source  in  the  sense  that  they 
are  true  with  regard  to  all  instances  in  the  infor¬ 
mation  source. 

Consider  the  example  shown  in  Figure  10.  The 
input  query  retrieves  ship  types  whose  ranges  are 
greater  than  10,000  miles.  This  query  could  be 
expensive  to  evaluate  because  there  is  no  index 
placed  on  the  range  attribute.  The  system  must 
scan  all  of  the  instances  of  notional-ship  and 
check  the  values  of  the  range  to  retrieve  the  an¬ 
swer. 

A  set  of  applicable  rules  for  this  query  is  shown 


Input  Query: 

(retrieve  (?8ht-type  Tship  ?drait) 

(:and  (notional jhip  ?ship) 

(notional^hip.shtJim  ?ahip  ?8hip-type) 
(notionaljBhip.nax^aft  ?8hip  ?dralt) 
(not ional^hip. range  ?ship  ?range) 

(>  ?range  10000))) 

Figure  10:  Example  Subquery 


in  Figure  11.  These  rules  would  either  be  learned 
by  the  system  or  provided  as  semantic  integrity 
constraints  about  an  information  source.  Rule 
R1  states  that  for  all  ships  with  maximum  drafts 
greater  than  10  feet,  their  range  is  greater  than 
12,000  miles.  Rule  R2  states  that  all  ships  with 
range  greater  than  10,000  miles  have  fuel  capaci¬ 
ties  greater  than  5,000  gallons.  The  last  rule  R3 
simply  states  that  the  drafts  of  ships  are  greater 
than  12  feet  when  their  fuel  capacity  is  more  than 
4,500  gallons. 

Information-Source  Abstractions: 

Rl;(:if 

(:and 

(notionaljship  ?8hip) 
(notional^hip.naxAraft  Tship  ?draft) 

(not ionaljBhlp. range  ?8hip  ?range) 

(>  ?draft  10)) 

(:then  (>  ?range  12000))) 

R2:(:if 

(:and 

(notionaljBhip  ?8hip) 

(notional^hip. range  ?ship  ?range) 
(notional^hip.fueljcap  Tship  ?fueljcap) 

(>  ?range  10000)) 

(;then  (>  Tfuel.cap  5000))) 

R3:(:if 

(:and 

(notionaljship  ?8hip) 
(notional^hip.naxjlraft  ?ship  ?draft) 

(not ionaljhip. fuel j;ap  Tship  ?fuel-cap) 

(>  ?fuelj:ap  4500)) 

(:then  (>  ?draft  12))) 

Figure  11:  Applicable  Rules  in  the  Information- 
Source  Abstractions 

Based  on  these  rules,  the  reformulation  com¬ 
ponent  infers  a  set  of  additional  constraints  and 
merges  them  with  the  input  query.  The  result¬ 
ing  query  is  the  first  query  shown  in  Figure  12. 
This  query  is  semantically  equivalent  to  the  input 
query  but  is  not  necessary  more  efficient.  The  set 
of  constraints  in  this  resulting  query  is  called  the 
inf  err  ad  set.  The  system  will  then  select  a  sub¬ 
set  of  constraints  in  the  inferred  set  to  complete 
the  reformulation.  The  selection  is  based  on  two 
criteria:  reducing  the  total  evaluation  cost,  and 
retaining  the  semantic  equivalence.  Detailed  de¬ 
scription  of  the  algorithm  is  in  [Hsu  and  Knoblock, 


1993b].  In  this  example,  the  input  query  is  refor¬ 
mulated  into  a  new  query  where  the  constraint  on 
the  attribute  range  is  replaced  with  a  constraint 
on  the  attribute  maxjdraft,  which  turns  out  to  be 
cheap  to  access  because  of  the  way  the  information 
is  indexed.  The  reformulated  query  can  therefore 
be  evaluated  more  efficiently. 

Query  with  inferred  set: 

(retrieve  (?ship-type  ?ship  ?draft) 

(rand 

(notional.ship  ?ship) 

(notionalxhip .  sht jiD  ?ship  ?ship-type) 
(notionaljhip.maxAraft  ?ship  ?draft) 
(notional^hip.r2uige  ?ship  ?range) 
(notional^hip.fuelxap  ?ship  ?fuel-cap) 

(>  ?range  10000) 

(>  ?fuelxap  5000) 

(>  ?draft  12))) 

Reformulated  Query: 

(retrieve  (?sht-type  ?ship  ?draft) 

(rand 

(notionalxhip  ?ship) 

(notirsnal-ship.sht jun  Tship  ?ship-type) 
(notional-ship. max^aft  ?8hip  ?draft) 

(>  ?draft  12))) 

Figure  12:  Reformulated  Query 

The  reformulation  is  not  limited  to  removing 
constraints.  There  are  cases  when  the  system  can 
reformulate  a  query  by  adding  new  constraints 
or  proving  that  the  query  is  unsatisfiable.  The 
inferred  set  turns  out  to  be  useful  information 
for  extending  the  algorithm  to  reformulate  an  en¬ 
tire  query  plan.  Previous  work  only  reformulates 
single  database  queries.  In  addition,  our  algorithm 
is  polynomial  in  terms  of  the  number  of  infor¬ 
mation  source  abstraction  rules  and  the  syntactic 
length  of  the  input  query.  A  large  number  of  rules 
may  slow  down  the  reformulation.  In  this  case, 
we  can  adopt  sophisticated  indexing  and  hashing 
techniques  in  rule  matching,  or  constrain  the  size 
of  the  information-source  abstractions  by  remov¬ 
ing  information-source  abstractions  with  low  util¬ 
ity. 

We  can  reformulate  each  subquery  in  the  query 
plan  with  the  subquery  reformulation  algorithm 
and  improve  their  efficiency.  However,  the  most 
expensive  aspect  of  queries  to  multiple  informa¬ 
tion  soiuces  is  often  processing  intermediate  data. 
In  the  example  query  plan  in  Figure  9,  the  con¬ 
straint  on  the  final  subqueries  involves  the  vari¬ 
ables  ?draft  and  Tdepth  that  are  bound  in  the 
preceding  subqueries.  If  we  can  reformulate  these 
preceding  subqueries  so  that  they  retrieve  only  the 
data  instances  possibly  satisfying  the  constraint 
(<  ?draf  t  ?depth)  in  the  final  subquery,  the  in¬ 
termediate  data  will  be  reduced.  This  requires 
the  query  plan  reformulation  algorithm  to  be  able 
to  propagate  the  constraints  along  the  data  flow 
paths  in  the  query  plan.  We  developed  a  query 


plan  reformulation  algorithm  which  achieves  this 
by  updating  the  information-source  abstractions 
and  rearranging  constraints.  We  explain  the  algo¬ 
rithm  using  the  query  plan  in  Figure  9. 

The  algorithm  first  reformulates  each  subquery 
in  the  partial  order  (i.e.,  the  data  flow  order) 
specified  in  the  plan.  The  two  subqueries  to 
inform^ition  sources  are  reformulated  first.  The 
information-source  abstractions  are  updated  and 
saved  in  Inlerred-Set,  which  is  returned  from 
the  subquery  reformulation  to  propagate  the  con¬ 
straints  to  later  subqueries.  For  example,  when 
reformulating  the  subquery  on  notionaljship, 
(>  ?draft  12)  is  inferred  and  saved  in  the  in¬ 
ferred  set.  In  addition,  the  constraint  (>  Trange 
10000)  in  the  original  subquery  is  propagated 
along  the  data  flow  path  to  its  succeeding  sub¬ 
query.  Similarly,  the  system  can  infer  the  range  of 
?depth  in  this  manner.  In  this  case,  the  range  of 
?depth  is  41  <  ?depth  <  60. 

Now  that  the  updated  ranges  for  ?drait  and 
?depth  are  available,  the  subquery  reformulation 
algorithm  can  infer  from  the  constraint  (<  ?drait 
?depth)  a  new  constraints  (<  ?draft  60)  and 
add  it  to  the  subquery  for  the  join  operation.  How¬ 
ever,  this  constraint  should  be  placed  on  the  re¬ 
mote  subquery  instead  of  the  local  Loom  query 
because  it  only  depends  on  the  data  in  the  re¬ 
mote  information  source.  In  this  case,  when  up¬ 
dating  the  query  plan  with  the  reformulated  sub¬ 
query,  the  algorithm  locates  where  the  constrained 
variable  of  each  new  constraint  is  bound,  and  in¬ 
serts  the  new  constraint  in  the  corresponding  sub¬ 
queries.  In  our  example,  the  variable  is  bound  by 
(maxjdralt  ?ship  ?dralt)  in  the  subquery  on 
notionaljhip  in  Figure  9.  The  algorithm  will 
insert  the  new  constraint  on  ?dralt  in  that  sub¬ 
query. 

The  semantics  of  the  modified  subqueries,  such 
as  the  subquery  on  notionaljship  in  this  exam¬ 
ple,  are  changed  because  of  the  newly  inserted 
constraints.  However,  the  semantics  of  the  over¬ 
all  query  plan  remain  the  same.  After  all  the 
subqueries  in  the  plan  have  been  reformulated, 
the  system  reformulates  these  modified  subqueries 
again  to  improve  their  efficiency.  In  our  example, 
the  subquery  reformulation  algorithm  is  applied 
again  to  the  notional -ship  sub  query.  This  time, 
no  reformulation  is  found  to  be  appropriate.  The 
reformulated  subquery  of  the  final  query  plan  is 
shown  in  Figure  13. 

The  resulting  query  plan  is  more  efficient  and 
returns  the  same  answer  as  the  original  one.  In 
our  example,  the  subquery  to  notional-ship  is 
more  efficient  because  the  constraint  on  the  at¬ 
tribute  range  is  replaced  with  another  constraint 
that  can  be  evaluated  more  efficiently.  The  in¬ 
termediate  data  are  reduced  because  of  the  new 
constraint  on  the  attribute  ?dralt.  The  logical 
rationale  of  this  new  constraint  is  derived  from 


Reformulated  Subquery: 

(retrieve  (?sht-type  ?8hip  ?draft) 

(:2md 

(notional-ship  ?ship) 

(notional.nhip.8ht Jim  ?ship  ?8hip-typc) 
(notionalnhip.maxjdraft  ?ship  ?draft) 

(>  ?draft  12) 

«  ?draft  60))) 

Figure  13:  Reformulated  Query 

the  constraints  in  the  other  two  subqueries:  (> 
?range  10000)  and  (<  ?dralt  ?depth),  and  the 
rules  in  the  information-source  abstractions.  The 
entire  algorithm  for  query  plan  reformulation  is 
still  polynomial.  Our  experiments  shows  that  the 
overhead  of  reformulation  is  very  small  compared 
to  the  overall  query  processing  cost.  On  a  set  of  32 
example  queries,  the  query  reformulation  yielded 
significant  performance  improvements  with  an  av¬ 
erage  reduction  in  execution  time  of  43%. 

Learning 

An  intelligent  agent  for  information  retrieval 
should  be  able  to  improve  its  performance  over 
time.  To  achieve  this  goal,  the  information  agents 
currently  support  two  forms  of  learning.  First, 
they  have  the  capability  to  cache  frequently  re¬ 
trieved  or  difficult  to  retrieve  information.  Second, 
for  those  cases  where  caching  is  not  appropriate, 
an  agent  can  learn  about  the  contents  of  the  infor¬ 
mation  sources  in  order  to  minimize  the  costs  of  re¬ 
trieval.  Since  information  retrieval  agents  serve  as 
information  sources  for  other  agents,  both  caching 
and  learning  can  be  applied  to  information  agents 
as  well  as  data  and  knowledge  bases.  This  section 
describes  these  two  forms  of  learning. 

Caching  Retrieved  Data 

Data  that  is  required  frequently  or  is  very  expen¬ 
sive  to  retrieve  can  be  cached  in  the  local  agent  and 
then  retrieved  more  efficiently.  An  elegant  feature 
of  using  Loom  to  model  the  domain  is  that  cached 
information  can  easily  represented  and  stored  in 
Loom.  The  data  is  currently  brought  into  the  lo¬ 
cal  agent  for  processing,  so  caching  is  simply  a 
matter  of  retaining  the  data  and  recording  what 
data  has  been  retrieved. 

To  cache  retrieved  data  into  the  local  agent  re¬ 
quires  formulating  a  description  of  the  data  so  it 
can  be  used  to  answer  future  queries.  This  can  be 
extracted  from  the  initial  query,  which  is  already 
expressed  in  the  form  of  a  domain-level  descrip¬ 
tion  of  the  desired  data.  The  description  defines 
a  new  subconcept  and  it  is  placed  in  the  appro¬ 
priate  place  in  the  concept  hierarchy.  The  data 
then  become  instances  of  this  concept  and  can  be 
accessed  by  retrieving  all  the  instances  of  it. 

Once  the  system  has  defined  a  new  class  and 
stored  the  data  under  this  class,  the  cached  infor- 


mation  becomes  a  new  information  source  for  the 
agent.  The  reformulation  operations,  which  map 
a  domain  query  into  a  set  of  information  source 
queries,  will  automatically  consider  this  new  in¬ 
formation  source.  Since  the  system  takes  the  re¬ 
trieval  costs  into  account  in  selecting  the  infor¬ 
mation  sources,  it  will  naturally  gravitate  towards 
using  cached  information  where  appropriate.  In 
those  cases  where  the  cached  data  does  not  cap¬ 
ture  all  of  the  required  information,  it  may  still 
be  cheaper  to  retrieve  everything  from  the  remote 
site.  However,  in  those  cases  where  the  cached  in¬ 
formation  can  be  used  to  avoid  an  external  query, 
the  use  of  the  stored  information  can  provide  sig¬ 
nificant  efficiency  gains. 

The  use  of  caching  raises  a  number  of  important 
questions,  such  as  which  information  should  be 
cached  and  how  the  cached  information  is  kept  up- 
to-date.  We  are  exploring  caching  schemes  where, 
rather  chan  caching  the  answer  to  a  specific  query, 
general  classes  of  frequently  used  information  are 
stored.  This  is  especially  useful  in  the  Internet 
environment  where  a  single  query  can  be  very  ex¬ 
pensive  and  the  same  set  of  data  is  often  used  to 
answer  multiple  queries.  To  avoid  problems  of  in¬ 
formation  becoming  out  of  date,  we  have  focused 
on  caching  relatively  static  information. 

Learning  about  the  Contents  of 
Information  Sources 

The  agent’s  goal  is  to  provide  efficient  access  to 
a  set  of  information  sources.  Since  accessing  and 
processing  information  can  be  very  costly,  the  sys¬ 
tem  strives  for  the  best  performance  that  can  be 
provided  with  the  resources  available.  This  means 
that  when  it  is  not  processing  queries,  it  gathers 
information  to  aid  in  future  retrieval  requests.  The 
information  agents  improve  performance  by  learn¬ 
ing  about  the  contents  of  the  information  sources 
[Hsu  and  Knoblock,  1993a]. 

The  learning  is  triggered  when  an  agent  detects 
an  excessively  expensive  query.  In  this  way,  the 
agent  will  incrementally  gather  a  set  of  rules  to 
reformulate  expensive  queries.  The  learning  sub¬ 
system  uses  induction  on  the  contents  of  the  infor¬ 
mation  sources  to  construct  a  less  expensive  spec¬ 
ification  of  the  original  query.  This  new  query  is 
then  compared  with  the  original  to  generate  a  set 
of  rules  that  describe  the  relationships  between  the 
two  equivalent  queries.  The  learned  rules  are  in¬ 
tegrated  into  the  agent’s  domain  model  and  then 
used  for  semantic  query  reformulation. 

Advantages  of  the  Architecture 

Now  that  we  have  described  the  basic  architec¬ 
ture,  this  section  first  reviews  the  critical  features 
of  this  architecture  and  then  describes  the  advan¬ 
tages  provided  by  these  features. 

The  critical  features  of  this  architecture  that 
support  multiple  cooperating  agents  are: 


1.  A  uniform  query  language  that  is  used  as  the 
interface  for  the  user  as  well  as  th^:.  interface 
between  agents. 

2.  A  unified  model  of  the  domain  and  sepa¬ 
rate  models  of  the  contents  of  the  information 
sources. 

3.  The  dynamic  selection  of  an  appropriate  set  of 
information  sources. 

4.  The  generation  of  parallel  query  access  plans. 

5.  The  use  of  semantic  knowledge  to  oplimi:”  the 
query  plans. 

6.  A  learning  system  that  improves  the  perfor¬ 
mance  of  an  agent  by  caching  frequently  used 
information  and  learning  about  the  contents  of 
the  information  sources. 

First,  the  uniform  query  language  and  separate 
models  provide  a  modular  architecture  for  mul¬ 
tiple  information  agents.  An  information  agent 
for  one  domain  can  serve  as  an  information  source 
to  other  information  agents.  This  is  can  done 
seamlessly  since  the  interface  to  every  informa¬ 
tion  source  is  exactly  the  same  -  it  takes  a  query 
in  a  uniform  language  (i.e..  Loom)  as  input  and 
returns  the  data  requested  by  the  query.  The 
domain  model  provides  a  uniform  language  for 
queries  about  information  in  any  of  the  sources 
to  which  an  agent  has  access.  The  contents  of 
each  agent  is  represented  as  a  separate  information 
source  and  is  mapped  to  the  domain  model  of  an 
agent.  Each  information  agent  can  export  some 
or  all  of  its  domain  model,  which  can  be  incor¬ 
porated  into  another  information  agent’s  model. 
This  exported  model  forms  the  shared  terminol¬ 
ogy  between  agents. 

Second,  the  separate  domain  and  information 
source  models  and  the  dynamic  information  source 
selection  make  the  overall  architecture  easily  ex¬ 
tensible.  Adding  a  new  information  source  sim¬ 
ply  requires  building  a  model  of  the  information 
source  that  describes  the  contents  of  the  informa¬ 
tion  source  as  well  as  how  it  relates  to  the  domain 
model.  It  does  not  require  integrating  the  new 
information  source  model  with  the  other  informa¬ 
tion  source  models  since  the  mapping  between  do¬ 
main  and  information  source  models  is  not  fixed. 
Similarly,  changes  to  the  contents  of  information 
sources  require  only  changing  the  model  of  the 
specific  information  source.  Since  the  selection  of 
the  information  sources  is  performed  dynamically, 
when  an  information  request  is  received,  the  agent 
will  select  the  most  appropriate  information  source 
that  is  currently  available. 

Third,  the  separate  domain  and  information 
source  models  and  the  dynamic  information  source 
selection  also  make  the  agents  very  flexible.  The 
agents  can  choose  the  appropriate  information 
sources  based  on  what  they  contain,  how  quickly 
they  can  answer  a  given  query,  and  what  resources 


are  currently  available.  If  a  particular  informa¬ 
tion  source  or  network  goes  down  or  if  the  data 
is  available  elsewhere,  the  system  will  retrieve  the 
data  from  sources  that  are  currently  available.  An 
agent  can  take  into  consideration  the  rest  of  the 
processing  of  a  query,  so  that  the  system  can  take 
advantage  of  those  cases  where  retrieving  the  data 
from  one  source  is  much  cheaper  than  another 
source  because  the  remote  system  can  do  more  of 
the  processing.  This  flexibility  also  makes  it  pos¬ 
sible  to  cache  and  reuse  information  without  extra 
work  or  overhead. 

Fourth,  building  parallel  query  access  plans,  us¬ 
ing  semantic  knowledge  to  optimize  the  plans, 
caching  retrieved  data,  and  learning  about  infor¬ 
mation  sources  provide  efHcient  access  to  large 
numbers  of  information  sources.  The  planner  gen¬ 
erates  plans  that  minimize  the  overall  execution 
time  by  maximizing  the  parallelism  in  the  plan  to 
take  advantage  of  the  fact  that  separate  informa¬ 
tion  sources  can  be  accessed  in  parallel.  The  se¬ 
mantic  query  reformulation  provides  a  global  opti¬ 
mization  step  that  minimizes  the  amount  of  inter¬ 
mediate  data  that  must  be  processed.  The  ability 
to  cache  retrieved  data  allows  an  agent  to  store 
frequently  used  or  expensive-to-retrieve  informa¬ 
tion  in  order  to  provide  the  requested  information 
more  efficiently.  And  the  ability  to  learn  about 
the  contents  of  the  information  sources  allows  the 
agent  to  exploit  time  when  it  would  not  other¬ 
wise  be  used  to  improve  its  performance  on  future 
queries. 

Related  Work 

A  great  deal  of  work  has  been  done  on  building 
agents  for  various  kinds  of  tasks.  This  work  is 
quite  diverse  and  has  focused  on  a  variety  of  is¬ 
sues.  First,  there  has  been  work  on  multi-agent 
planning  and  distributed  problem  solving,  which 
is  described  in  [Bond  and  Gasser,  1988].  The 
body  of  this  work  deals  with  the  issues  of  coor¬ 
dination,  synchronization,  and  control  of  multi¬ 
ple  autonomous  agents.  Second,  a  large  body  of 
work  has  focused  on  defining  models  of  beliefs, 
intentions,  capabilities,  needs,  etc.,  of  an  agent. 
[Shoham,  1993]  provides  a  nice  example  of  this 
work  and  a  brief  overview  of  the  related  work  on 
this  topic.  Third,  there  is  more  closely  related 
work  on  developing  agents  for  information  gather¬ 
ing. 

The  problem  of  information  gathering  is  also 
quite  broad  and  the  related  work  has  focused  on 
various  issues.  Kahn  and  Cerf  [1988]  proposed  an 
architecture  for  a  set  of  information-management 
agents,  called  Khowbots.  The  various  agents  are 
hard-coded  to  perform  particular  tasks.  Etzioni 
et  al.  [1992,  1994]  have  built  agents  for  the  Unix 
domain  that  can  perform  a  variety  of  Unix  tasks. 
This  work  has  focused  extensively  on  reasoning 
and  planning  with  incomplete  information,  which 


arises  in  many  of  these  tasks.  Levy  el  al.  1199-1] 
are  also  working  on  building  agents  for  retrieving 
information  from  the  Internet.  The  focus  of  their 
work  has  been  on  developing  a  formal  framework 
for  selecting  a  minimal  set  of  sites  to  answer  a 
query. 

In  contrast  to  much  of  tliis  previous  work,  the 
focus  of  our  work  is  on  flexible  and  efficient  re¬ 
trieval  of  information  from  heterogeneous  infor¬ 
mation  sources.  Since  most  of  these  other  systems 
have  in-memory  databases,  they  assume  that  the 
cost  of  a  database  retrieval  is  small  or  negligible. 
One  of  the  critical  problems  when  dealing  with 
large  databases  is  how  to  issue  the  appropriate 
queries  to  efficiently  access  the  desired  informa¬ 
tion.  We  are  focusing  on  the  problems  of  how  to 
organize,  manipulate,  and  learn  about  large  quan¬ 
tities  of  data. 

Research  in  databases  has  also  focused  on  build¬ 
ing  integrated  or  federated  systems  that  com¬ 
bine  information  sources  [Landers  and  Rosenberg, 
1982,  Sheth  and  Larson,  1990].  The  approach 
taken  in  these  systems  is  to  first  define  a  global 
schema,  which  integrates  the  information  available 
in  the  different  information  sources.  However,  this 
approach  is  unlikely  to  scale  to  the  large  number 
of  evolving  information  .sources  (e.g.,  the  Internet) 
since  building  an  integrated  schema  is  labor  inten¬ 
sive  and  difficult  to  maintain,  modify,  and  extend. 

The  Carnot  project  [Collet  e<  al,  1991]  also  inte¬ 
grates  heterogeneous  databases  using  a  knowledge 
representation  system.  Carnot  uses  a  knowledge 
base  to  build  a  set  of  articulation  axioms  that  de¬ 
scribe  how  to  map  between  SQL  queries  and  do- 
meiin  concepts.  After  the  axioms  are  built  the  do¬ 
main  model  is  no  longer  used  or  needed.  In  con¬ 
trast,  the  domain  model  of  one  of  our  agents  is  an 
integral  part  of  the  system,  and  allows  an  agent  to 
both  combine  information  stored  in  the  knowledge 
base  and  to  reformulate  queries. 

Conclusion 

This  paper  described  the  SIMS  architecture  for 
intelligent  information  retrieval  agents.  This  par¬ 
ticular  architecture  has  a  number  of  important 
features:  (1)  modularity  in  terms  of  representing 
an  information  agent  and  information  sources,  (2) 
extensibility  in  terms  of  adding  new  information 
agents  and  information  sources,  (3)  flexibility  in 
terms  of  selecting  the  most  appropriate  informa¬ 
tion  sources  to  zuiswer  a  query,  and  (4)  eflSciency 
in  terms  of  minimizing  the  overall  execution  time 
for  a  given  query. 

To  date,  we  have  built  information  agents  that 
plan  and  learn  in  the  transportation  planning  do¬ 
main.  These  agents  contain  a  detailed  model  of 
this  domain  and  extract  information  from  a  set 
of  nine  relational  databases.  The  agents  select 
appropriate  information  sources,  generate  paral¬ 
lel  plans,  execute  the  queries  in  parallel,  and  learn 
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FViturr  wofk  wilt  fcKu»  on  rxtrntiinK  thr  plan- 
ning  and  lining  rapabiiilin  d(«crib<><i  in  (hin 
p*p«r  An  important  uaur  that  wr  havr  not 
y«t  addrraacJ  is  how  to  handle  the  various  forms 
of  incompleteneas  and  inconsistency  that  will  in¬ 
evitably  arise  from  usinjt  autonomous  information 
sources.  Our  plan  is  to  address  these  issues  by  ex¬ 
ploiting  available  domain  knowledge  and  employ¬ 
ing  more  sophisticated  planning  and  reasoning  ca¬ 
pabilities  to  both  detect  and  recover  from  these 
problems. 
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